Biological Pattern Discovery with R Machine Learning Approaches (Zheng Rong Yang)

ty is maximised. Suppose ܆

܆஺∪܆஻, where A and B stand for

es, for instance, a class of non-cleaved peptides and a class of

peptides. Thus a space can be decomposed using the following

܆^௧ܟൌ܆஺

௧∪܆஻

௧ൌܡො஺∪ܡො஻

(3.4)

ܡො஺ and ܡො஻ will form two densities. It is expected that a mixture

f ܡො஺ and ܡො஻ should be bimodal meaning that the density of ܡො஺

ensity of ܡො஻ are well separated from each other so as to be able

minate between two classes of data points, i.e., the peptides in the

or protease cleavage pattern discovery. This requires the

ty of the mixture density of ܡො஺ and ܡො஻ to be maximised. In other

discriminant analysis needs to maximise the distance between

ities of ܡො஺ and ܡො஻ and to minimise the overlap between two

of ܡො஺ and ܡො஻ to generate an optimal LDA model. Unless these

itions have been well-satisfied, a LDA model will not work well.

ear discriminant analysis model is in the format shown in the

(3.2). The values of ݕො௡ are called the projected data or the

ns onto the projection direction, which are continuous values. A

del is a parametric model because the projection direction is

rised by ݓଵ, ݓଶ, ⋯ and ݓௗ.

e projection direction optimisation

the values of ݕො௡ are determined by w, the density of ݕො௡ will vary

s. Figure 3.1 shows a classification problem, where the classifier

d by ݕො௡ ൌݓ଴൅ݓଵݔ௡ଵ൅ݓଶݔ௡ଶ. There are two scenarios (hence

els) of this classifier in this case, where the model parameters w

ferent values in two panels, hence two different models. The inset

show two different density patterns of the projections ݕො. The

ogram density of ݕො shown in Figure 3.1(a) is bimodal. However,

histogram density of ݕො shown in Figure 3.1(b) is unimodal. If two

wo models) of Figure 3.1 are compared, it can be seen that the key

bimodal distribution of projections is to optimise the projection